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ABSTRACT 


The multivariate analysis techniques of cluster 
analysis, principal components analysis, and discriminant 
analysis are examined in this thesis. The theory and 
applications of each of the techniques are discussed. 
Computer software available at the Naval Postgraduate School 
is discussed and sample jobs are included. 

A hierarchical cluster analysis algorithm, available in 
the IMSL software package, is applied to a set of data 
extracted from a group of subjects for the purpose of 
partitioning a collection of 26 attributes of a weapon 
system into six clusters of superattributes. 

A nonhierarchical clustering procedure, principal 
components analysis, and discriminant analysis were all 
applied to a collection of data on tanks considering of 
twenty-four observations of ten attributes of tanks. The 
cluster analysis shows that the tanks cluster somewhat 
naturally by nationality. The principal components analysis 
and the discriminant analysis show that tank weight is the 


single most important discriminator among nationality. 
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I. DISCUSSION OF MULTIVARIATE DATA ANALYSIS 


A. INTRODUCTION 

As a set of statistical techniques, multivariate data 
analysis is concerned with data collected on several dimen- 
Sions of the same observations. Techniques can be used for 
Many purpose in the behavioral, mathematical, and adminis- 
trative sciences - ranging from rigidly controlled experi- 
ments to explain relationships assumed to be present in a 
large mass of data to attempts to cluster similar elements 
or to find functions of the variables that will best 
discriminate among preselected subpopulations of the 
observations. 

The heart of any multivariate analysis consists of the 
data matrix. This matrix is a table that gives the results 
of a number of observations on a number of variables 


simultaneously (Table I). 


Illustrative Data Matrix 


| Variables 
| Observations 1 2 Zieeee jo eee p 
| ss oh as Rs SS a 
i j “lp 
i 
| . Xo. *22 *23  *2j Xap 
7 oa ty as fr ea 1 *ip 
n x x x x x 
nl n2~ n3 nj np 
TABLE I. 


The table consists of a set of observations (the n 
rows) and a set of measurements on those observations (the 
p columns). Cell entries represent the value Xij of 
observation i on variable j . The values are 
characteristics of the observations and serve to define the 


observations in any specific study. The cell values may 


consist of nominal, ordinal, interval, or ratio-scaled 


measurements, or varicus combinations of these across columns. 


In a general sense "multivariate" analysis would con- 
' cern two main features: 
1. The multivariate character lies in the 
. multiplicity of the p variables, not 


in the size of the set n. 


wee 


2. The variables are dependent among them- 


selves so that we can not split off one 


See A 


or more from the others and consider it 


by itself. The variables must be 


a ee 


considered together. 


A aap 3. 


There are three characteristics often used as a basis 


for the classification of multivariate analysis: 


ere ere? 


1. whether one's principal focus is on the 
objects or on the variables of the data 
matrix; 

2. whether the data matrix is partitioned 
into criterion and independent subsets, 
and the number of variables in each; 

3. whether the cell values represent 
nominal, ordinal, or interval scale 
measurements. 

This classification results in four major subdivisions of 
interest: 
1. single criterion, multiple predictor 
association, including multiple regression, 
analysis of variance and covariance, and 


two-group discriminant analysis; 
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multiple criterion, multiple predictor 
association, including canonical correla- 
tion, multivariate analysis of variance 
and covariance, and multiple discriminant 


analysis; 


analysis of variable interdependence, 


including factor analysis, multidimen- 
sional scaling, and other types of 
dimension-reducing methods; 

analysis of interobject similarity, 
including cluster analysis and other 
types of grouping procedures. 

The first two categories involve dependence structures 
where the data matrix is partitioned into criterion and 
independent subsets; in both cases interest is focused on 
the variables. The last two categories are concerned with 
interdependence - either focusing on variables or on 
observations. Within each of four categories, various 
techniques are differentiated in terms of the type of scale 
assumed. 

In this research, we consider only the following 
techniques of multivariate analysis: 

1. Principal components analysis 

2. Discriminant analysis 


3. Cluster analysis 


| 

i 
: II. PRINCIPAL COMPONENTS ANALYSIS 

| The basic idea of principal components analysis is to 

| describe the dispersion of an array of n points in p- 
dimensional space by introducing a new set of orthogonal 
linear coordinates so that the sample variances of the 
given data points with respect to these derived coordinates 
are in decreasing order of magnitude. Thus the first 
principal component is such that the projection of given 
points onto it have maximum variance among all possible 


linear coordinates; the second principal component has 


maximum variance subject to being orthogonal to the first; 


2 an OE ee mt 


and so on. 

Suppose that the random variables xX ; X> gi orcad be ese 
Xy of interest have a certain multivariate distribution 
with finite mean vector u and variance-covariance matrix 


=. 


w= 


From this population a sample of n_ independent 


observation vectors has been drawn. The observation can 


be written as the usual nxp data matrix. 


- ' 
oS gel eas Ye Xi 

' X = : a : (1) 
PLS Pee eae tes Ves Xn 
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nee RR ete tne SO 


awe. 


The estimate of £ will be the usual sample variance- 


covariance matrix S defined as follows: 


(2) 


The information we shall need for our principal compo- 
nents analysis will be contained in S . However, it will 
be necessary to make a choice of measures of dependence: 
should we work with the variances and covariances of the 
observations, and carry out the analysis in original unit 
of the responses, or would a more accurate picture of the 
dependence pattern be obtained if each Xi, were trans- 


formed to a standarized score 


and the correlation matrix R employed? The components 
obtained from S and R in general not the same, nor is 
it possible to pass from one solution to the other by a 
simple scaling of the coefficients. 

If the responses are in widely different units (i.e., 
number of crew, weight in tons, speed in kilometer per 
hour, etc.) with large differences in the magnitudes, 


linear compounds of original quantities would have little 


ees eeNONG gaat ase nna EE Sma can? OIE a a = ne a tlhe A RR 8 OI A we RRA RPE ly OEP) AER + sar 


Moe Roe 


meaning and standarized variates and correlation matrix 
should be employed. Conversely, if the responses are 
reasonably commensurable, the covariance form has a greater 
statistical appeal, for the i-th principal component is 
that linear compound of the responses which explains the 
i-th largest portion of the total response variance, and 
maximization of such total variance of standard scores is 
rather artificial. 

The first principal component of the complex of sample 
values of the responses Xy ; X, yeaa le ate sie ; x is the 


linear compound 


Yy = 441%) Wale duce rovers + aX (3) 


whose coefficients aj, are the elements of the eigenvector 
associated with the greatest eigenvalue My of the sample 
variance-covariance matrix of the responses. The aj, are 
are unique up to multiplication by a scale factor, and if 
they are scaled so that a'ya, = 1, the eigenvalue Ay is 
interpretable as the sample variance of Yy- 

Numerical representation of the first principal compo- 


nent is to find the vector Ay such that 


(4) 
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OE COLLET A gO Racca 0a Bm t's OE an ne ca a EL A SOS a Se BPN st RD el he eee ane 


is er AS 


which maximizes sample variance 


P 
zr a.,4.,5.. 
1 j=l be Ba Cas Ks 


for all coefficient vectors normalized so that ANA, =i1l. 


To determine the coefficients, the normalization constraint 


is introduced by means of Lagrange multiplier and the 


resulting expression is differentiated with respect to A'y.: 


2 
BR ESy, 7 AC ALTA] = gp TAY'SAL + ag AyD) 
ed 1 11 
(6) 
= 2(S - ADA, 
The coefficients must satisfy the p simultaneous linear 
equations. 
(S - AVDA, = 0 (7) 
If the solution to these equation is to be other than the 
null vector, the value of Ay must be chosen so that 
|S - A,1| = 0 (8) 


Ad is thus an eigenvalue of the variance-covariance matrix, 
and Ay is its associated eigenvector. To determine which 


of the p eigenvalues should be used, premultiply the 
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<1 1S RR 8 a ENS ARRAN ay ON TOOT OO | so eee a we 
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the system of equation (7) by A,' . Since ALTA) z# 1, it 


follows that 


But the coefficient ‘vecotr was chosen to maximize this 
variance, and therefore, A, must be the greatest eigen- 
value of §S. 


The second principal component is that linear compound 


X (9) 


whose coefficients have been chosen, subject to the con- 
straints 
’ 
Ay A, = 1 


(10) 
A,' A, = 0 


so that the variance of Y > A,’ i) Ay » is a maximum. 
The first constraint is merely a scaling to assure the 
uniqueness of the coefficients, while the second requires 
that Ay and A, be orthogonal. 

The coefficients of the second component can also be 
found by the Lagrangian technique with two multipliers Ao 


and uu. Differentiating this with respect to A, gives: 
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Tn af abl eed sitet 


foe eee ne ee SRE cee ece oleh, Raa 


c) 
Big 2: SA, + AZ G1 - A,' Ay) + WA," Ay] 


(11) 


If the right-hand side is set equal to 0 and premultiplied 
by Ay' ,» it follows from the normalization and orthogonality 
conditions that 

2A," SA, + u = 0 (12) 


Similar premultiplication of the equation (7) by A,' 


implies that 


A,' S Ay = 0 (13) 


2 
and hence u=0.. The second vector must satisfy 


(S - AZI)A, = 0 (14) 


And it follows that the coefficients of the second component 
are thus the elements of the eigenvector corresponding to 
the second greatest eigenvalue. The remaining principal 
components are found in their turn in the same manner from 
the other eigenvectors. 

Thus the j-th principal component of the sample of 


p-variate observations is the linear compound 


Y; ® A.4X, t+ soswessca? @..X (15) 


ee nee Oe a were poner 
EY fe 
Oe eee 


whose coefficients are the elements of the eigenvector of 


the sample variance-covariance matrix S corresponding to 
the j-th largest eigenvalue dj - If a; 7 5 , the 
coefficients of the i-th and j-th components are 
necessarily orthogonal; if he = dj » the elements can be 
chosen to be orthogonal, although an infinity of such 
orthogonal vectors exists. The sample variance of the 
j-th components is dj » and the total system variance is 


thus 


hy + ro Pe areieiare'ens + dy = tr s (16) 


The importance of the j-th component in a more parsimonious 


description of the system is measured by 


os an 


which gives the fraction of the total variance contributed 


to the j-th component. 


18 


; 
| 
) 
: 
| 
; 
t 
| 


ease nea oe ae a ae CE Pei acer cage SPIT GF I Me ene RS merit ER! ee NN rR 


III. DISCRIMINANT ANALYSIS 


A. INTRODUCTION 

The basic idea of discriminant analysis consists of 
assigning an individual from a group of individuals to one 
of several known or unknown distinct propulations, on the 
basis of observations on several characters of the indi- 
vidual or group and a sample of observations on these 
characters from the populations if these are unknown. 

Fisher (1936) was the first to suggest a linear 
function of variables representing different characters, 
hereafter called the linear discriminant function (discrimi- 
nator) for classifying an individual into one of two popu- 
lations. Later research extended the analysis to classifica- 
tion into one of k populations. 

For the univariate case Fisher suggested a rule which 
classifies an observation x into the i-th univariate 


population if 


X - x, = min (X - X, »>X- XX), j'di,2 (18) 


where X is the sample mean based on a sample of size Ny 
from i-th population. For two p-variate populations 

™ and To (with the same covariance matrix) Fisher 
replaced the vector random variable by an optimum linear 


combination of its components obtained by maximizing the 
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ratio of the difference of the expected values of a linear 


combination under Ty) and 1, to its standard deviation. 
He then used his univariate discrimination method with this 
optimum linear combination of components as the random 
variable. 

Rao (1948) considered the problem of classifying 
people into one of these populations castes of India. He 
assumed that each of the three populations could be 
characterized by four variables - structure (x1), sitting 
height (xo), nasal depth (x3), and nasal height (x4) - of 
each member of the population. On the basis of sample 
observations on these characters from the three populations 


the problem is to classify an individual with observation 


X = Caos into one of three populations. He used 


a linear discriminator to obtain the solution. 


B. THEORY 
In general, the underlying assumptions of discriminant 
analysis are: 
1. the groups being investigated are discrete 
and identifiable; 
each observation in each group can 
described by a set of measurements 
characteristics or variables; 
these p variables are assumed to 
a multivariate normal distribution 


population. 


Oa apenas a ID “ita er —* Ene a. a eS cnet pene PE 8 ES -  R CCEMIT a P EE et | ey ciliate, 


The purposes of discriminant analysis are: 


1. to test for mean group differences and to 
describe the overlaps among groups; 
2. to construct classification schemes based 
upon the set of p variables in order to 
aeaign previously unclassified observations 
to the appropriate groups. 
Hence, the problem of studying the direction of group 
differences is, equivalently, a problem of finding a linear 
combination of the original independent variables that 
shows large differences in group means. In short, dis- 
criminant analysis is a method for determining scuh linear 
combinations. 
The first step toward determining a linear combination 
of a set of variables such that several group means on this 


linear combination will differ widely among themselves, is 


differences. Once a linear combination has been constructed, 
that means there is just a single tranformed variable. 

Hence, the F-ratio for testing the significance of the 

over all difference among several group means on a single 


variable suggests an appropriate criterion. 
Fa Ube (19) 


where ov! * (VysVayeeceeeeeeeees Vp)» a set of weight which 


maximizes A . 


to decide on a criterion for measuring such group-mean 
i 
| 
{ 


G 
B = & N.(x, - xX )(x. - x )! 
‘ee : dl. 
G | ; : 
W = z z X23 7 X; ss ’ 
ee ee Fe ij Xi ) (x55 xi) 
ij is the jth observation vector in the i-th 
group. 


Xi is the grand mean vector of the data. 

G is the number of groups. 

n, is the number of observations in the ith group. 

Prime notation indicates transpose. 
This ratio A , called the discriminant criterion, was 
originally proposed by Fisher in connection with his two- 
group discriminant function. Once a criterion for group 
differentiation has been determined, a set of weights, 
(vy Vad teeeeeeee . Vp)» which maximizes this criterion, 
should be determined. This is accomplished by taking the 


partial derivative of 2 with respect to each component 


v. of v and setting the result equal to zero. 


i 
oA . 20 (Bv) (v'Wv) - (v'Wv) (Wv)] 
ov 


(v'Wv) 


S, a Raa IOI EE cca UO a NEE at Senet ERE ENE A EU wm Oe oie aE eA OS ta Re ce 


(20) 


. 2(Bv - Wv) . 0 
viWv 


which is equivalent to 


(B - AW)v = 0 


=] (21) 
(W °B - AI)v = 0 
This equation is of the form 
(A - AI)v = 0 (22) 


It's solution, yielding the eigenvalues do and associated 
eigenvectors Vp of the matrix A , is therefore the same 
as in the principal components analysis, and thus the solved 
problem satisfies the problem of maximizing the discriminant 
criterion. 

In the last equation, the number of non-zero eigenvalues 
of a square matrix A is equal to the rank of A. With 
wis playing the role of A , the number of non-zero eigen- 
values depends on the rank of B , since the rank of the 
product of two matrices can not exceed the smaller of the two 


1 (being nonsingular) must be 


factor matrices' ranks, and W 
of full rank p , while the rank of B is usually smaller 
than p . Thus it is possible to denote the rank of B by 
r= min (G-l,p) . 

From the fact that the eigenvalues dp are the values 
assumed by the discriminant criterion for linear combination 


using the elements of the corresponding eigenvectors P 


as combining weights, it is clear that the eigenvector 


23 


woe wens kee 
ay 


Vv 


a = (Via9Vy 2 te ee tieate Vip) provides a set of weights such 


that the transformed variable 


Yy = Vii%4 + V12%2 Pe ieeiletete Gos + vy X (23) 


has the largest discriminant-criterion, A , achievable by 


oer RE Fe Se nen Me ne he 


any linear combination of the p independent variables. 


What are the properties of the remaining eigenvectors, 


we ne ee en 


VarVareeeeee »v. ? The second discriminant function 
Yo # Vo1Xy + V99X4 teeeeeeee + VopXp whose weights are the 


elements of the eigenvector Vo associated with the second 
a8 


largest eigenvalue \, of WB has the largest 


the X,; that are uncorrelated with the first discriminant 
function in the total sample observation. Its proof is 


analogous of that of princpal components analysis. Each 
maximum value for its discriminant criterion. Therefore, it 


needs nonly to show that Y, is uncorrelated with Yi : 


} 
| 
| 
discriminant-criterion among those linear combinations of 
{ 
| Noting that this correlation is proportional to v,'Tv, 


discriminant function has a relative (or conditional) 


(where T = W + B), we have to prove that V1'Tv = 0 


(B - AW)v, = 0 for each i (24) 
hence, 


By, = hi Wy and By» = A,Wv5 


agg tek = a es - emg > 


Se eri 


premultiplying these equations by v,' and vy! respectively, 


i v2'Byy hyV2'Wvy 
j (25) 
v,'Bv, = hav, 'Wv, 


taking the transpose of both sides of the first equation (B 


' and W are symmetric) 


' J 
Vy BV, A,Vv,'W, 


a ee ee es ee 


thus 
AV, 'W, = AoV,'W, 
(Ay - A,)V,'WV, = 0 
since dy # Ay, Vy'WV, = 0 


ph acetates, 


therefore, V,'W, = 0 which means Vy and Vv, are 
uncorrelated, and Y, has this property: its discriminant- 
criterion value, A, , is the largest achievable by any 


linear combination of X's that is uncorrelated (in the 


eI 


total sample) with Yq - Similarly 


X44 X (36) 


has the largest possible discriminant-criterion value (A 5) 
among all linear combinations of the X's that are uncorrelated 


with both Y, and Y,; and so on until Y, using the 
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: ? 
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elements of Vv. as weights, has the largest possible 
discriminant-criterion value among linear combinations that 


are uncorrelated with all the preceding linear combinations 


nee eee ee ee ra 


YyoYgoeeeee eo Vn ly . The linear combinations YyoYo> eeeeaets 


eo are called the first, second, ....., r th (linear) 
discriminant functions for optimally differentiating among 


the g given groups. 

The situation here is reminiscent of principal com- 
ponents analysis. There, the dimension corresponding to the 
first component had maximum variance; the second-component 


dimension had maximum variance among those uncorrelated with 


— wm 


\ the first; and so on. In discriminant analysis, the ratio 

of between-to within-groups sums-of-squares merely takes the 
place of variance as the criterion in determining the 
successive dimensions. However, an important difference 
between the dimensions identified in discriminant analysis 


and those in component analysis is that the former are 


eS pe etna 


generally not mutually orthogonal in the test space, even 
though they are uncorrelated. That is, the axis representing 
the discriminant functions are not a subset of axes obtainable 
by rigid rotation of the original system of p axes; the 
discriminant rotation in an oblique rotation. 

Just as in the principal components analysis, the 
dimensions represented by the discriminant functions may be 
interpreted meaningfully. Even if they are not, it may be 
possible to achieve parsimony by reducing the dimensionality 


of the space needed to describe group differences. In 
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3 a er re ae 
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seeking to interpret the discriminant functions, the goal is 
to determine which of the original p variables contribute 
most to each function. For this prupose, comparison of the 
realtive magnitudes of the combining weights as given by the 


1, is inappropriate 


elements of each eigenvector of W 
because these are weights to be applied to the variables in 
raw-score scales, and are hence affected by the particular 
unit used for each variable. 

To eleminate the spurious effects of units of measure- 
ment on the magnitudes of combining weights, standarized 
variables should be used. 

The relative magnitudes of these standarized weights may 
be assessed by multiplying each raw-score weight by the 
standard deviation of the corresponding variable as computed 
from the within-groups SSCP (Sum of Squares, Cross product) 
matrix. This amounts to multiplying each element of a given 
eigenvector V._ by the square root of the corresponding 


m 
diagonal element of W . Thus, for each m _, define 
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as the standarized discriminant weights. The relative con- 
tribution of the i th variable to the mth discriminant 
function may then be gauged by the magnitude of Ving in 
comparison with the other weights Ynj- ‘ 


Up to this point, it has been shown that the dimension- 


ality of the discriminant space is equal to the number of 
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nonzero eigenvalues of wo 


B , which is the smaller of the 
two numbers, G-l and p .. It may often happen, that the 
number of significant discriminant dimensions may be even 

smaller. That is, not all of the discriminant function may 


represent dimensions along which statistically significant 


group differences occur. 


C. SIGNIFICANCE TEST IN DISCRIMINANT ANALYSIS 

A basic quantity in testing the significance of the 
overall difference among several group centroids (mean 
vectors) the ratio of the determinants of the within- 
groups and the total SSCP matrices, known as Wilks' A 


criterion. 


Motivation for use of this equation may be seen as follows: 


! - 
po iq, Wot 
= jwhow + B)| (29) 


= (1 + A)(1 + hg)ieeee, Cl + .) 


where 2,,135--+,Ay are the nonzero eigenvalues of wt 


B. 
Consequently, Bartlet's V statistic for testing the 


significance of an observed value can be expressed as 
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v2=- (N- 1 - (p + G)/2]1na 


[N- 1 - (p+ G)/2] In [(1 + a,)C + ALD] (30) 


r 
([N- 1 - (p + G)/2] = In(l + Am) 
m= 1 


This statistic is distributed approximately chi-square with 
p(G-1) degrees of freedom. 

Because of the uncorrelatedness of the successive 
discriminant functions, the successive terms In(1 + in? 
in the last expression above are statistically independent 
(assuming multivariate normality of the original p 
variables). As a result, the additive components of V_ are 
each approximately distributed as a chi-square variate. 
More specifically, the m th. component, 

a [IN - 1 - (p + G)/2] In (1 + hn) (31) 
is approximately chi-square with p +G- 2m degrees of 
freedom. It may be readily verified that the sum of the 
number of degree of freedom (n.d.f) of the r components, 
that is, (p +G- 2) + (p+Gs4)+...... +(p +Ge- 2r) , 
is equal to p(G - 1) regardless of whether r= G- 1 or 
pP- 

Consequently, when we cumulatively subtract Vi»V2 P 
and so on from V , the remainder each time is also a 
chi-square variate; and these successive remainders become 
appropriate statistics for testing whether the residual 


discrimination after removing the first discriminant 
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function, the first and second discriminant function, and 
so forth, is statistically significant. The successive 
test statistics and their n.d.f.'s may be summarized as 


follows: 


Residual After Approximate d.f 
Removing - Statistic Renae 


First discriminant Vv - Vy p(G-1) - (p+G-2) 
Function = (p-1) (G-2) 


First 2 discriminant V - V1 - Vv, (p-1) (G-2)-(p+G-4) 
Function =(p-2) (G-3) 


First 3 discriminant Vv - Vi - V2 - Vz (p-2) (G-3) - (p+G-6) 
Function =(p-3) (G-4) 


First s discriminant V-V1-V2-Vz..--V, (p-s) (G-(s+1)) 
Function 


As soon as the residual, after removing the first s 
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discriminant functions becomes smaller than the prescribed 
percentile point (that is, the 100(1 - a)th percentile) of 
the appropriate chi-square distribution, we may conclude 
that only the first s discriminant functions are 
significant at that a level. If the number of signifi- 
cant discriminant functions thus found is smaller than r 
(as will often be the case), we will have effected a further 
reduction in the dimensionality of the space required to 


describe the differences among the G_ groups from which 


our sample groups were drawn. The remaining r-s dimensions 
may be regarded as immaterial for population differentiation, 


since our sample differences along these dimensions can be 


ee ee 


attributed to sampling error. 
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IV. CLUSTER ANALYSIS 


A. ORIGIN AND THEORY 

Clustering is the grouping of similar objects. The 
principal functions of clustering are to name, to display, 
to summarize, to predict, and to aid in interpretation of 
data with many dimensions. Clustering techniques were 
first developed in the field of biological taxonomy. It is 
one of several methodologies included in the broader cate- 
gory called classification. 

The cluster analysis problem is the last step we 


consider in the progression of category sorting problems. 


While in discriminant analysis some part of the structure 


is known and missing information is estimated from labeled 
samples, the operational objectives of clustering is to 
classify new observations, that is, recognize them as members 
of one category or another. In cluster: analysis little or 
nothing is known about the category structure. All that is 
available is a collection of observations whose category 
membership are known. We seek to discover a category 
structure which fits the observations. The problem may be 
stated as one of finding the "natural groups", which means 
to sort the observations into groups such that the degree of 
"natural association" is high among members of the same 


group and low between members of different groups. 
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Cluster analysis techniques have been applied in many 
fields of study. The literature is both voluminous and 
diverse, the terminology differing from one field to 
another. "Numerical taxonomy" is frequently substituted 
for cluster analysis among biologists, botanists, and 
ecologists, while some social scientists may refer "typology". 
Other frequently encountered terms are pattern recognition 
and partitioning. While discriminant analysis has been 
studied by statisticians for nearly 45 years, cluster 
analysis has only recently come to statistical notice. Any 
method which partition a set of objects into subsets on 
the basis of measurements taken on every object qualifies 
as a clustering method. 

Most of the well known clustering techniques fall into 
one of two main categories: (1) hierachical and (2) non- 
hierachical (partitioning). The former is one in which 
every cluster obtained at any stage is a merger of clusters 
at previous stages. The nonhierachial procedures however 
form new clusters by lumping and splitting old ones. We 
consider both categories shortly. 

In a geometric sense, every observation may be viewed 
as a point in p-dimensional Euclidean space. This swarm of 
data points may contain dense regions or "clouds" of data 
points which are separable from other regions containing a 
low density of points. These denser regions constitute what 
are known as clusters. In one and two dimensional cases, it 


is easy to visualize and to detect the clusters from scatter 
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plots, assuming that the clusters exist. In higher 
dimensions, clustering becomes extremely difficult without 
the aid of a computer. 

Mathematical clustering techniques usually require a 
measure of similarity to be defined for every pairwise 
combination of the entities to be clustered. In order to 
solve the cluster problem, it is desirable to define the 
terms "similarity" and "difference" in a quantitative 
fashion. A researcher would assign two observations to 
the same group if the distance between them is sufficiently 
small, or to different clusters if this distance is 
sufficiently large. 

At this point, two questions may be brought on. The 
first one is "how do we measure the distance between the 
observations?" and the second ormeis "how small is small 
enough?" and how large is large enough? These will be 


discussed in the following sections. 


B. MEASURES OF DISTANCE 
1. General 
Let ED be a symbolic representation for a 
measurement in p-dimensional space and let X,Y, and Z be 
any of these points in ED . Then any nonnegative real- 
valued function D(X,Y) satisfying the following conditions 
qualifies as a distance function (or metric). 


1. D(X,Y) = 0 if and only if X= Y 


2. D(X,Y) > 0 for all X and Y in E 
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3. D(X,Y) = D(Y,X) 
4. D(X,Y) < D(X,Z) + D(Y,Z) 


Many clustering algorithm assume such distances given and 
set about constructing clusters of objects within which the 
distances are small. The choice of distance function is no 
less important than the choice of variables to be used in 
the study. A serious difficulty in choosing a distance lies 
in the fact that a clustering structure is more primitive 
than a distance function and that knowledge of clusters 
changes the choice of distance function. Thus a variable 
that distinguishes well between two established clusters 
should be given more weight in computing distances than a 


"junk" variable that distinguishes badly. 


2. Euclidean Distance 
The Euclidean distance between the I-th and K-th 
observations of a data matrix X is defined as 


1/2 


D(I,K) -| (N(L oy eK, (32) 


z 
l1<J<p 


where J is J-th variable. In one, two, or three 
dimensional space, this is just a "straight line" distance 
between the vectors corresponding to the I-th and K-th 
observations. When the variables are measured in different 
units, it is necessary to prescale the variabes to make 
their values comparable or, equivalently, to compute a 


weighted Euclidean distance. 
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2]1/2 
D(I,K) = WJ) (X(T,J) - (X(K,J)) (33) 
1< J <p 


This form of distance is not necessary if all variables are 
measured on the same scale. However, even in this case, 
weights might be used to increase or decrease the importance 
of same variable. Various weighting schemes have been 
utilized in practice. One common weighting scheme lets 
W(J) be the reciprocal of the variance of variable J. 

A general class of squared distance functions is 
provided by utilizing positive definite quadratic forms. 
Specifically, if p represents a p-dimensional observation 
to be assigned to one of s_ groups, then to measure the 
squared distance between the observation 8 and the 
centroid (mean vector) of the i-th group one may consider 


the function 
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where M is a positive definite matrix to ensure that 
D,<°0 . Different distance functions are represented by 
different choices of the matrix M . When M#I_ (the 
identify matrix) the resulting metric is the standard 
Euclidean distance. Distances with the Euclidean metric are 
shown in Figure la. The variance within the data may make 


the unweighted Euclidean metric inappropriate. As shown on 


the Figure 1b, where X has a larger variance than Y , 


ee 


one may wish to weight a deviation in the X direction less 
than an equal deviation in the Y direction. This is a 
weighted Euclidean distance frunction which makes point A 
and B equidistance from the origin. In this case, the 
matrix M is diagonal elements which are the reciprocals of 
the variances of the different variables. 

Extending this idea further, it may be possible to 
consider the covariance among variables as well. Figure lc 
shows how the axis may be rotated so that the major axis is 
oriented in a direction of reflecting the positive correla- 
tion between X and Y . Again, points on the same 
ellipse are considered equidistance from the origin. The 
matrix M in this case is the inverse of the covariance 
matrix. 

Further extension of this concept will expalin some sort 
of generalized distance function. If C; Tepresents the 
covariance matrix of the i th cluster then the distance 
function 

Dy = (p<. 3." 
uses the appropriate covariance structure when determining 
the distance to a particular cluster centroid. Since Ci 
changes to reflect the dispersion internal to each particular 
cluster, the use of this metric exploits differences in the 
dispersion characteristics of the different groups. As shown 


on Figure ld, not how a new observation (denoted by u) is 
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la, Buclidean measure 1b. Measure of squared distance 
of squared distance. with different weights for variables. 


lc. Generalized squared ld. Classification when within- 
distance measure. group dispersions are different. 


Figure 1. Euclidean Distance 
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closer to the centroid of group one (Gl) in terms of 
Euclidean distance but is more likely to be assigned to 


group two (G2) when using the C; matrix. 


3. Mahalanobis Distance 
Another choice for the M matrix in equation (1) 
is pi where P represents the pooled within groups 


covariance matrix of all the clusters. 


u 1 W 
oe = ae, (34) 
5 1 
i=l 
where G 
W = zr We 
k=l 


This distance is the well known Mahalanobis distance. Note 
that P does not change from group to group. To ensure 

the non-singularity of P it must be true that p< (N- 9, 
where N_ represents the total number of observations over 
all groups. Rewriting the distance, 


7 : ee | : 
D,= (8 - eyo wR - ey (35) 


defines a distance between mean vectors 8 and Xj and 
common covariance matrix W . The Mahalanobis distance 
function adjusts for both scale of measurement of the 


variables and covariation among the variables. Use of this 


metric is equivalent to computing distances on variables 
transformed to their principal components. This metric is 


invariant under any nonsingular transformation of original 


eee red 


variables. For consider the transformation 


Y = BX (36) 


' and let D(Y;5¥;) represent Mahalanobis distance between 


Y; and ¥; 
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! D(¥;5Y;) a (Y; = Y;) Py (Y; = Y;) 
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(BX; - BX;)"Py3(BX; - BX;) 


ee 
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Some other common metrics are listed below: 


1. L, norm (City Block) 


P 
D(X; ,Xj) = k 1 Xk = Xe, | 
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Ze Ly norm (Minkowsky Metrics) 


P 
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D(X; 5X5) a z ea Xl) 
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3. Uniform norm 
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C. HIERARCHICAL CLUSTERING 


1. General 

The previously discussed distance measures may be 
used to construct a similarity matrix describing the length 
of all pairwise relationships among the entities (variables 
or data units) in the data set. The methods of hierachical 
cluster analysis operate on this similarity matrix to con- 
struct a tree depicting specified relationships among the 
entities. As shown on Figure 2, the branches on the left 
each represent one entity while the root represents the 
entire collection of entities. Moving down the tree 
from the branches toward the root depicts increasing aggre- 
gation of the entities into clusters. Hierarchical 
clustering methods which build a tree from branches to 
root often are called agglomerative methods. 

Once a tree is constructed for N entities, the 
analyst may choose from as many as N_ sets of clusters. 
These clusters are nested. From the agglomerative view, 
when two entities are merged they are joined togehter per- 
manently and considered as one entity for later merges; from 
the divisive view, when a group of entities is split into 
two parts, the parts are separated permenently and may be 


treated independently for the remainder of the analysis. 
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Figure 2. Tree for Hierarchical Clustering 


Herein lie both the strength and weakness of 
hierarchical methods: by taking early decisions as perma- 
nent, the number of posibilities that need be examined is 
reduced greatly as compared with complete enumeration; 
but this same convention precludes discovering early 
mistakes or capitalizing on later opportunities. 

There are three major hierarchical clustering concepts: 

1. Linkage Methods 
2. Centroid Methods 


3. Error sum of squares or variance methods. 


All of these methods are suitable for clustering data units. 


However, only the linkage methods are considered in this 


research. 
2. The General Agglomerative Procedure 
Let s.; be the similarity between entities i 


1j 
and j as defined by one of the distance measures previously 


discussed. Assuming that the similarity is symmetric, the 


complete schedule of similarities for all CG) = SN(N - 1) 
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possible pairwise combinations of entities may be arrayed 

in a lower triangular similarity matrix as in Figure 3. 

The $i; entries are nonnegative. This limitation is of 
consequence only for correlation and the cosine of the angle 


between vectors; the distinction between positive and nega- 


tive association cannot be utilized in these clustering 


methods. 
394 
| $32 
S4y 242 S43 
S = ; . 
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Figure 3. Lower Triangular Similarity Matrix 


A simple remedy is to use the absolute value or the square of 
the measure if it can assume negative values. Once the 
matrix is defined, the process of clustering entities is 
almost trivially simple. The general procedure for agglomer- 
ative clustering on a data matrix is as follows: 
(1) Begin with n clusters each consisting of 
exactly one entity. Let the clusters are 


labled with the numbers 1 through N. 
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(2) Search the similarity matrix for the most 
Similar pair of clusters. Let the chosen 
clusters be labeled p and q and let 
their associated similarity be Sq? p> q. 

(3) Reduce the number of clusters by 1 thorugh 
merger of clusters p and q. Label the 
product of the merger q and update the 
Similarity matrix entities in order to 
reflect the revised similarities between 
cluster q and all other existing clusters. 
Delete the row and column of S_ pertaining 
to cluster p 

(4) Perform steps 2 and 3 a total of N-1 times 
(at which point all entities will be one 
cluster). At each step record the identity 
of the clusters which are merged and the value 
of similarity between them in order to have 


a complete record of the results. 


Different agglomerative methods are implemented by 
varying the procedures used for defining the most similar 
pair at step 2 and for updating the revised similarity 
matrix at step 3. The similarity matrix is a given array 
of numbers. The numerical execution of the clustering 
procedures is completely independent of how the similarity 
values were generated or whether the entities to be 
clustered are variables or data units. However, it is 


necessary to make a direct distinction between distance-like 
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Measures (the smallest values correspond to the most similar 
pairs) and correlation-like measures (the largest values 


correspond to the most similar pairs); the essential 
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difference is whether the search for the most similar pair 
involves seeking the minimum or maximum entry in the simi- 


larity matrix. 


3. Single Linkage 


The method of single-linkage cluster analysis is 


ae eNO = et tee SS re 


the simplest of all hierarchical techniques. At each stage, 
after clusters p and q _ have been merged, the similarity 
between the cluster (labeled t) and some other r_ is 


determined as follows: 


ee OD line me? Ee 


1. If Si; is the distance-line measure 


s = min (s j (37) 


tr pr’ qr 


The quantity Sty is the distance between the two closest 
members of clusters t and r. me: clusters t and r 
were to be merged, then for any entity in the resulting 
cluster the distance to its nearest neighbor would be at 


most Str . 


2. If Sij is a correlation-like measure 


s__) (38) 


s = max (Soy qr 


tr 


The quantity Ser is the similarity between the two most 


similar entities in clusters t and r. If clusters t 


45 


meee meres: 


“pice gt co MB) 75 thet? 


Sy Mana a lla a le 


TR eae ee RO a a oR oa an SENNA A ae OO” A RS SAR an ae et Nem 


gears RRR Re SEO Poe 


and r were to be merged, then for any entity in the - 
resulting cluster there would be at least one other entity 

in the same cluster such that the pair would have a similarity 
at least as large as Str: 

The method is known as single linkage because clusters 
are joined at each stage by the single shortest or strongest 
link between them. Since the updating process involves 
choosing only the minimum or maximum single-linkage clustering 
is invariant to any transformation which leaves the 
ordering of the similarities unchanged; that is, any monotonic 


transformation. 


4. Complete Linkage 
The complete-linkage method is related to the single- 


linkage method and is no more difficult to execute. At each 
stage, after clusters p and q_ have been merged, the 
similarity between the new cluster (labeled t) and some 


other cluster r is determined as follows: 


i ee Fe 2 Sij is distance-like measure 
Sep = max (Sor?Sqr) (39) 
The quantity Ser is the distance between the most distant 


members of clusters t and r. If clusters t and r 
were merged, then every entity in the resulting cluster 
would be no farther than s,, from every other entity in 
the cluster. The value of See is the diameter of the 
samllest sphere which can enclose the cluster resulting from 
the merger of clusters t and r. 
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2. If s.. is a correlation-like measure 


) (40) 


s = mi s 
n ( p 


tr r’?°qr 


The quantify s is the similarity between the two most 


tr 
dissimilar entities in clusters t and r. If clusters 

t and r were to be merged, then every entity in the 
resulting cluster would have a similarity of at least Ser 
with every other entity in the cluster. 

The method is called complete linkage because all 
entities in a cluster are linked to each other at some 
maximum distance or minimum similarity. Such a cluster 
is called a "maximally connected subgraph" in graph theory. 
In contrast to the single-linkage method, interpretation 
of the clusters can be made only in terms of the relation- 
ships within individual clusters; there is no particularly 
useful interpretation involving the differences between 
clusters. Like the single-linkage method, complete-linkage 
cluster analysis is invariant to monotonic transformations 
of the similarity measure. Johnson (1967) discusses this 


property in both single and complete linkage methods. 


D. NONHIERARCHICAL CLUSTERING 

Nonhierarchical clustering methods are designed to 
cluster data units into a single classification of g 
clusters, where g either is specified a priori or is 
determined as a part of the clustering method. The central 


idea in most of these methods is to choose some initial 
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partition of the data units and then alter cluster member- 
ships so as to obtain a better partition. The various 
algorighms which have been proposed differ as to what 
constitutes a "better partition" and what methods may be used 
for achieving improvements. 

The broad concept for these methods is very similar 
to that underlying the steepest descent algorithms used 
for unconstrained optimization in nonlinear programming. 
Such algorithms begin with an initial point and then 
converge to a local optimum, moving one step at a time, 
the value of the objective function improving at each step. 

The methods of nonhierarchical clustering typically 
may be used with much larger problems than the hierarchical 
methods because it is not necessary to calculate and store 
the similarity matrix; it is not even necessary to store 
the data set. In general, the data units are processed 
serially and can be read from tape or disk as needed. This 
characteristic makes it possible, at least in principle, to 
cluster arbitrary large collections of data units. 

In this research, we consider only the partitioning 
method known as '"K-MEANS" which was developed by MacQueen 
(15). He used the term "K-MEANS" to denote the process of 
assigning each data unit to that cluster (of k clusters) 
with the nearest centroid (mean vector). The cluster 
centroid changes with each transfer of an observation. 

The decomposition of the total scatter matrix into 


within and between groups matrices suggests possible 
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optimality criteria to be used in a clustering algorithm. 
One would like the within-groups scatter to be small 
relative to the between-groups scatter. Various trial 
clusterings could be formed using the W and B_ matrices 
as a basis for the optimality criteria which determine the 
best clustering. A possible choice for a criterion is to 
minimize trace W over all partitions into g_ groups. 
Since T is constant over all partitions, minimizing trace 


W is equvalent to maximizing traces B_ since 


trace T = trace W + trace B (41) 


Although trace W is invariant under an orthogonal 
transformation, it is not invariant under other non-singular 
linear transformations. 

McRae (16) points out that trace W equals the total 
within group sum of squares, hence the "minimum variance 
partition" cluster solution is found by minimizing trace W. 

Considerable study has been developed to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assuming the p variables are not linearly dependent, then 
as long as p2N-g, W is positive definite symmetric 

1 


and so is W . Attempts to make B and W as different 


as possible lead one to solving the determinantal equation: 


|B - AW| = 0 (42) 
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The solutions AG are the eigenvalues of the matrix wp 


as in discriminant analysis. There are t non-zero 
eigenvalues, where t is the minimum of p and g-l 

This is a consequence of the fact that, if g is less than 
p , the g group means are considered in a (g-1)-dimensional 
hyperplane. When g = 2 the analysis is equivalent to 
two-group discriminant analysis. Linear discriminant 
analysis would take the vectors originally described in 
p-dimensional coordinate system and transform the basis to a 
t-dimensional system. Maximizing the largest of these 
eigenvalues is a criterion suggested by S.N. Roy and 
maximizing the trace of wip ,» however is a criterion 
suggested by Hotelling. In both cases, large values for 
these statistics are sought in clustering algorithms since 
large values indicate large differences among (between) 
groups. Minimizing the ratio of determinants |W| = |T| 

is a criterion widely known as Wilks' lambda discussed in 
the discriminant analysis. Since T is the same for all 


partitions, this criterion is equivalent to minimizing 
1 


determinant W . Both trace WB and |T| = {W| may 
be expressed in terms of the eignevalues of wo4B 
T t 
lq i ae Say SD (43) 
i= 1 
-l] t 
trace WB = EZ Ay (44) 
i=l 
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where t 2 min(p, g-1) . Therefore minimizing det W is 
equivalent to maximizing m(1 + dj). 

Friedman and Rubin (6) describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W ) are invariant under 
changes in scale for varibles (non-singular linear trans- 
formation). In fact, they are the only invariants for W 
and B under such transformations. In addition, the 
multivariate criteria may take into account covariation 


among the variables. 
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V. ANALYSIS OF MULTIVARIATE UTILITY DATA 


To illustrate hierarchical clustering we applied the 

technique described in the previous chapter to partition a 

set of twenty six attributes of a close-air support weapon 
system into a smaller collection of "superattributes". As 
part of an effort to evaluate the military utility of a 
proposed alternative U.S. Marine Corps air support rada 
system, AN-TPQ/27. Barr and Richards (4) extracted 26 
attributes of the TPQ-27 and a baseline system, the AN-TPQ/10, 


and then had members of the Operational Test and Evaluation 


‘Team assess the utility of the TPQ/27 relative to that of the 


TPQ/10. In order that the additive model used to combine 
unidimensional relative utilities into a system relative 
utility be justifiable, it is necessary that the utilities 
satisfy certain independence properties described in Keeney 
and Raiffa (12). 

Because those independence properties are very diffi- 
cult for decision makers to verify for complex alternatives 
like the weapon systems under study, Professors Barr and 
Richards attempted instead to work with the attributes to 
try to generate a new collection which would likely satisfy, 
at least approximately, the conditions required to justify 
the additive model. 


The original collection of 26 attributes is as follows: 
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Portability 

Durability 

Time to Set Up 

Time to Take Down 

Ease of Assigning Aircraft to Targets 
Number of Aircraft Controlled 
Number of Targets 

Communications 

Mission Flexibility 

ASRT Survivability 

Time to Locate and Acquire Aircraft 
Accuracy of Tracking 

Accuracy of Delivery 

Range 

Aircraft Vulnerability 

Aircraft Attack Throughout 

Base of Adjustment and Evaluation of Results 
Accuracy of Feedback 

Ease of Operation 

Man-Machine Compatibility 

Training Requirements 

Reliability 

Maintainability 

Supportability 

Availability 


Documentation 


$3 


where a, represents an attribute i and 


It is easy to verify that D is a metric as defined in 
Chapter IV. Since we will actually work with a similarity 
measure in the hierarchical cluster procedure, we define 


the similarity between two attributes a; and a. as 


12 
S(a. , ay) = z 


i : I(x;; ; Xj) (46) 


1 
One can see from this definition that the similarity between 
two attributes a; and a, is simply the number of team 


members who placed attritutes aj and ay in the same 


partition. For example, 


S(a, a5) =O+12+0+12+0+0+4+0+12+0+0+0 
+1+024 


Either S or D can be used in the computer program 
shown in Appendix A for hierarchical clustering. One need 
only indicate whether he wants a correlation-like (larger 
values imply more similar) measure or a distance-like 


measure (smaller values imply more similar). We selected to 


use the former method. The similarity matrix extracted from 


Data Matrix 
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from the data is shown in Table V-3. We present only lower 
triangular elements since S(a; > a.) = 12 forall i and 
the matrix is symmetric; i.e., S(a; ; ay) = S(a, : aj). 
Zero values are not written. 

The results from the hierarchical clustering are shown 
in Figure 4. The numbers printed along the left hand 
margin refer to the attribute numbers. As you proceed to 
the right through the tree you will observe numbers greater 
than 26. These correspond to the clusterings that takes 
place from one step to the next. For example, the number 
27 shown at the juncture of 25 and 22 means that the first 
attribute clustered together should be 25 and 22 (this is 
the most similar pair). This combination is then considered 
as a new attribute which is later combined with the attribute 
30 (itself a combination of 23 and 24) to form the attribute 
31. This is later combined with attribute 2 to form 
attribute 40, etc. 

As discussed in Chapter IV a decision has to be made as 
to how many clusters (superattributes) are desired. All 
hierarchical methods will continue clustering until there 
is a single cluster. In order to decide on the number of 
clusters (and their composition) one need only image drawing 
a vertical line through the tree at various places. Each 
intersection of the tree with the vertical line results ina 
cluster. For example, teh vertical line at the point A 


results in the 6 clusters shown in Table V-4. 
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It is clear from observing the above collection that 
some of the attributes are highly correlated and nonredun- 
dant. If one tries to assign an importance weights to each 
attributes separately, there is a distinct likelihood that 
some of the overlapping strongly into related attributes 
might effectively be double or triple weighted or more 
producing biased result. It is an effort to prevent this 
from happening, Barr and Richards aksed the utility assess- 
ment team to partition the 26 attributes into a smaller 
collection in such a way that attributes within a group are 
similiar and attributes in different groups are unrelated the 
sense that utility assessments for attributes in one group 
do not depend on the amounts of attributes in any other 
group. 

The total number of groups was not prespecified. 
Instead, each team member was allowed to partition the 26 
attributes into any number of groups. The resulting multi- 
variate data array is shown in Table V-2. An element Xij 
is the number of the group into each team member j put 
attribute i. 


Let us define a distance measure for this data array 


as follows: 


D(a; ,a,) = z ao - I(x3; ; XK 5) (45) 
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Table III. Similarity Matrix for Superattribute Determination . 
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Figure 4. Tree for 26, Attributes 
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The superattributes used in the utility study are those 
sohwn in Table IV. A careful examination of the attributes 
| which comapre the clusters shows that the results so obtained 
| are intuitively agreeable. The names supplied to the 
i superattribures are somewhat natural descriptions of the 
clusters obtained. 

The listing of the computer program and sample output 


are given in Appendix A. 
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Table IV. Superattributes 


Superattributes Component Attributes 


Facility of movement . Portability 
Time to set up 
Time to take down 


Facility of Use . Ease of assigning 
aircraft to targets 

(precision) . Number of aircraft 
controlled 
Number of targets 
Mission flexibility 
Time to locate and 
acquire aircraft 
Aircraft attack 
throughput 
Ease of adjustment 
Accuracy of feedback 
Ease of operation 
Man-machine compatibility 
Accuracy of tracking 
Accuracy of delivery 


ASRT Survivability 
Aircraft vulnerability 


Training requirements 
Documentation 


Durability 
Reliability 
Maintainability 
Supportability 
Availability 
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VI. ANALYSIS OF ARMY TANK DATA 


A. DATA STRUCTURE 
In order to illustrate the nonhierarchical clustering 


methodology, principal components analysis, and discriminant 


7 analysis data on Army tanks from eight different countries 


oo teeters + 


were taken from Jane's Book of Weapon Systems (1979-80). 
A total of twenty-four tanks were included in the data 


array with observation on each of 10 variables. The 10 


variables are listed below: 


ae ee ee 
nN 


9. 
10. 


Weight (ton) 

Length (meter) 

Width (meter) 

Height (meter) 

Road Speed (kilometer per hour) 
Trench Crossing (meter) 

Ground Pressure (Kg/cm?) 
Maximum Armament (rounds) 
Ground Clearance (meter) 


Power to Engine Ratio (BHP/ton) 


The twenty-four tanks and the associated countries are 


shown below: 


Identification Number Type/Name Country 
11 T-62 
12 T-54 U.S.S.R. 
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Identification Number Type/Name Country 


13 T-10 
14 ASU-85 
15 MK-5/Chieftain 
16 MK-3/Vickers 
17 MK-13/Centurion U.K. 
18 CVR(T)/Scorpion 
19 XM-1 
20 M60A2 
21 M60 U.S.A. 
22 M48 
23 M47 
24 PZ61 
SWITZER- 
25 PZ68 LAND 
26 STRV-103 } 
SWEDEN 
27 Ikv-91 ‘ 
28 TYPE61 
JAPAN 
29 TYPE74 
30 Leopard 2 
31 Leopard W. GER- 
MANY 
32 TAM 
33 AMX 3° 
FRENCH 
34 AMX 13 
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We conjecture that a cluster analysis of the tank data will 


result in clusters corresponding to nationality since the 
nations may have different emphasis on the variables in 


the design of their tanks. 


B. NONHIERARCHICAL CLUSTER ANALYSIS OF TANK DATA 
1. The MIKCA Algorithm 
The specific algorithm chosen for the nonhierarchical 
cluster analysis for the tank data is the MIKCA (Multivariate 
Iterative K-MEANS Clustering Algorithm) program written by 
Douglas J. McRae as a part of his doctoral dissertation 
at the University of North Carolina, Chapel Hill. 

Reference to the flow chart in Figure 5 will aid the 
reader in following discussion of the algorithm. Inputs to 
program are the data matrix, an estimate for g (the 
number of clusters), and choice of criterion and distance 
functions. 

In the first step, preliminary claculations are made, 
such as the variable means and standard deviations, as 
well as the cross product matrix T . The next step forms 
the initial cluster centers. Then each of the other 
observations is assigned to the nearest cluster. Euclidean 
distance is used for this initial phase, and the cluster 
centroids are recomputed after each observation is assigned 
to a group. The observations are considered in the same 
order as they were input. After all of them have been 
assigned to clusters, the criterion value is computed. 


This initial cluster-finding technique is referred to as a 


PRELIMINARY 
CALCULATIONS 


INITIAL CLUSTER 
SOLUTION 


K - MEANS 
CRITERION 


IMPROVEMENT ? 


i 
| 
j 
| 


INDIVIDUAL SWITCHES 


Figure 5. MIKCA Flow Chart 
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one-pass K-MEANS procedure. It is performed three times, 
and the solution which yields the best criterion value is 
chosen as the initial cluster solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observations 
are again considered in the order in which they were input to 
the program. It is this phase where the user's choice of 
distance function is used. The distance from each observa- 
tion to each cluster centroid is again computed, this time 
with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n 
observations in this manner, the new criterion value is 
checked for possible improvement during the K-MEANS iteration. 
As long as the criterion value improves, the K-MEANS 
procedure is repeated; if the criterion fails to improve then 
the MIKCA algorithm goes to the next step, the individual 
switches section. Note the importance of the order of 
consideration of the observations. The order is important 
because the cluster means are recomputed after each observa- 
tion is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of 
the criterion results. An elaborate labelling procedure 


provides a unique order in which to consider each observation. 
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his procedure continues until a complete pass through the 
data is made with no changes in cluster membership. 

The MIKCA alogorithm provides the following options for 
distance and criterion functions. 

Criterion 

1. Minimum trace W 

2. Minimum determinant W 

3. Maximum largest order of |B - AW| 

4, Maximum sum of roots of [|B - AW|{ 


Distance 
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1. Euclidean 


N 


Weighted Euclidean 
3. Mahalanobis : 


A complete computer program is listed in Appendix B. 


2. Cluster Results for Tank Data 
For clustering of the tank data we selected the | 
minimum trace W criterion and the weighted Euclidean 
distance function. The algorithm automatically provides 
weights for the weighted Euclidean distance function. 
The results of the clustering with four clusters are shown 
in Table V. 
The conjecture of clustering by nationalities is 
supported by the results. The three Soviet tanks make up 
one cluster and the two British and four of the United 


States tanks were found to be similar. A third cluster 


consists of four tanks which are very lightweight. The 
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final cluster consists of the rest of the tanks, including 
tanks of United States allies from West Germany, France, 
Sweden, Switzerland and Japan. 
A natural question to ask after observing the results 
of a cluster analysis is what variables most strongly 
influence the clustering that was observed. A clue is 
provided by the composition of the cluster containing all of 
the lightweight tanks. This suggests that weight is an 4 
important distinguishing feature. This is examined in the 
principal components analysis and the discriminant analysis 


in the next two section. 


C. PRINCIPAL COMPONENTS ANALYSIS 

The Statistical Package for Social Sciences (SPSS) 
(14) subprogram FACTOR was used for the principal components 
analysis. It is designed both for the factor analysis and 
the principal components analysis. The outputs are designed 
to be self-explanatory. In this example, the first 5 
components accoung for 90% of the variance and the remaining 
components account for only 10% of the variance (Figure 6). 

The subprogram FACTOR provides a graphical presentation 

(Figure 7) for the factors that have been determined by the 
orthogonal rotations (in this example, variance maximization 
rotation). In reading the graphs, one should be attentive 
to following three features: (1) the relative distance of 
a variable from the axis, (2) the direction of a variable 
in relation to the axis, and finally (3) clustering of 


variables and their relative position to each other. 
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Figure 6. 
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For example, variables 5 (road speed) and 10 (power to 
engine ratio) contribute heavily to the first principal 
component while variables 1 (wieght) and 3 (width) 
contributes most strongly to the second principal component. 
Variables 2, 4, 6, 7, 8, 9 are not as important. The 
weights accorded each variable in the 10 factors (principal 
components) are shown in Figure 8. The complete SPSS 


program is listed in Appendix C. 


D. DISCRIMINANT ANALYSIS 

The SPSS subprogram DISCRIMINANT was used to determine 
that function or those functions of the 10 variables that 
best discriminant among the four clusters determined in 
previous section. 

The maximum number of discriminant functions to be 
derived is either one less than the number of groups or equal 


to the number of discriminating variables. This subprogram 


provides two measures for judging the importance of 


discriminant functions. One of these is the relative 
percentage of the eigenvalue associated with the function. 

It is a measure of the relative importance of the function. 
The sum of the eigenvalues is a measure of total variance 
existing in the discriminating variables. Since discriminant 
functions are derived in order of their importance, this 
process can be stopped whenever the relative percentage is 
judged to be too small. Of course, there is no fixed rule 
for deciding whatis too small. In this research, we selected 


arbitrary, a significance level of 0.10. The output shown 
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in Figure 9 suggests that we therefore consider only the 
first two discriminant functions. 

The second measure judging the importance of a 
discriminant function is its associated canonical correlation. 
The canonical correlation is a measure of association between 
the single discriminant function and the set of (g-1) dummy 
variables which define the g group memberships. It tells 
us how closely the function and the group variable are 
related, which is just another measure of the function's 
ability to discriminate among the groups. From Figure 10, 
the first two discriminant functions are each highly corre- 
lated with the groups but the third has only a moderate 
correlation. 

The next criterion for eliminating discriminant 
functions is to test for the statistical significance of 
discriminating information not already accounted for by 


the earlier functions. As each function is derived, 


starting with no (zero) functions, Wilks' lambda is computed. 


Lambda is an inverse measure of the discriminating power in 
original variables which has not yet been removed by the 
discriminant functions - the larger lambda is, the less 

is the information remaining. Lambda can be transformed into 
a chi-square statistic for an easy test of statistical 
significance. In Figure 9, Wilks' lambda was .594 after 

the first two functions had been derived. This corresponds 


to a chi-square of 8.8476 with a probability level of .1823. 
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This means that a lambda of this magnitude or smaller has a 
.1823 probability of occurring due to the chances of 
sampling even if there was no further information to be 
accounted for by a third function in the population. 
Clearly, a third function is not statistically significant 
in this case. 

The standarized discriminant function coefficients 
corresponding to the values of the Vij's discussed in the 
previous section are used to compute the discriminant score 
for a case (observation) in which the original discriminating 
variables are in standard form. The discriminant score is 
computed by multiplying each discriminating variable by its 
corresponding coefficient and adding together these products. 
There is a separate score for each observation on each 


function. The coefficients have been derived in such a way 


that the discriminant scores produced are in standard form. 


When the sign if ignored, each standard discriminant 
function coefficient represents the relative contribution of 
its associated variable to that function. The sign merely 
denotes whether the variable is making a positive or:.negative 
contribution. 

A graphical presentation is shown in Figure 11 using 
the first and the second canonical discriminant function as 
the axis. From this scatterplot, we can easily see that 
Soviet tanks (labelled 1) are well distinguished from the 


all of the others using only the first two discriminant 
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functions. Also, all the lightweight tanks are clearly 


separated from the others. The distinction between groups 


2 and 3 is also clear though not separated from each other 
as much as from groups 1 and 4. The complete SPSS program 


for the discriminant analysis is listed in Appendix D. 
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VII. CONCLUSION 


The multivariate analysis techniques of cluster analysis, 
principal components analysis and discriminant analysis are 
useful in real world problems for examining observations 
on each of several dimension. Each of the techniques is 
related mathematically to the others, and each complements 
the other in explaining the data. 

Computer software is readily available in many sources. 
The software used in this thesis for hierarchical clustering, 
principal components analysis, and discriminant analysis was 
from the IMSL package and SPSS. For nonhierarchical 
clustering, we used the FORTRAN program developed by McRae 
(16). All of this softw.re is readily available and 


documented at the Naval Postgraduate School. 
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